feat(source/dest huggingface): Improve the HF Datasets source, add HF Buckets source, add destinations#81357
feat(source/dest huggingface): Improve the HF Datasets source, add HF Buckets source, add destinations#81357Quentin Lhoest (lhoestq) wants to merge 2 commits into
Conversation
|
Note 📝 PR Converted to Draft More info...Thank you for creating this PR. As a policy to protect our engineers' time, Airbyte requires all PRs to be created first in draft status. Your PR has been automatically converted to draft status in respect for this policy. As soon as your PR is ready for formal review, you can proceed to convert the PR to "ready for review" status by clicking the "Ready for review" button at the bottom of the PR page. To skip draft status in future PRs, please include |
👋 Welcome to Airbyte!Thank you for your contribution from lhoestq/airbyte! We're excited to have you in the Airbyte community. If you have any questions, feel free to ask in the PR comments or join our Slack community. 💡 Show Tips and TricksPR Slash CommandsAs needed or by request, Airbyte Maintainers can execute the following slash commands on your PR:
Tips for Working with CI
📚 Show Repo GuidanceHelpful Resources
|
What
Continuation of #48734 by Michel Tricot (@michel-tricot) which was a first implementation the
source-huggingface-datasets. The new implementation uses thedatasetslibrary which is more efficient that using the dataset viewer's APIIn addition to this improvement, I added a new source
source-huggingface-bucketsthat points to Hugging Face Buckets (they simply are S3-like buckets)Finally I added the corresponding destinations
destination-huggingface-datasetsanddestination-huggingface-bucketsto close the loopHow
For datasets I used the
datasetslibrary which is based on Arrow/Parquet, and for buckets I used thehuggingface_hublibraryUser Impact
This will let user read/write to HF datasets/buckets
Can this PR be safely reverted and rolled back?
Disclaimer
The spec and metadata files are AI generated, do you have a pointer to some docs for me to review them ? The main code is me. This is causing the CI to fail